Sparse Group Restricted Boltzmann Machines 



Heng Luo Ruimin Shen 

Department of Computer Science Department of Computer Science 

Shanghai Jiao Tong University Shanghai Jiao Tong University 

hengluoS s jtu.edu.cn rmshen@s jtu .edu.cn 

Changyong Niu 

Department of Computer Science 
Zhengzhou University 

cyniu@sjtu.edu.cn 



Abstract 

Since learning is typically very slow in Boltzmann machines, there is a need to 
restrict connections within hidden layers. However, the resulting states of hidden 
units exhibit statistical dependencies. Based on this observation, we propose using 
Z1/Z2 regularization upon the activation possibilities of hidden units in restricted 
Boltzmann machines to capture the loacal dependencies among hidden units. This 
regularization not only encourages hidden units of many groups to be inactive 
given observed data but also makes hidden units within a group compete with 
each other for modeling observed data. Thus, the I1/I2 regularization on RBMs 
yields sparsity at both the group and the hidden unit levels. We call RBMs trained 
with the regularizer sparse group RBMs. The proposed sparse group RBMs are 
applied to three tasks: modeling patches of natural images, modeling handwritten 
digits and pretaining a deep networks for a classification task. Furthermore, we 
illustrate the regularizer can also be applied to deep Boltzmann machines, which 
lead to sparse group deep Boltzmann machines. When adapted to the MNIST data 
set, a two-layer sparse group Boltzmann machine achieves an error rate of 0.84%, 
which is, to our knowledge, the best published result on the permutation-invariant 
version of the MNIST task. 



1 Introduction 



Restricted Boltzmann Machines (RBMs) [1,2] recently have become very popular because of their 
excellent ability in unspervised learning, and have been successfully applied in various application 
domains, such as dimensionality reduction [3], Object Recognition [4] and others. 

In order to obtain efficient and exact inference, there are no connections within the hidden layer in 
RBMs. But by considering statistical dependencies of states of hidden units we may learn a more 
powerful generative model [5] although directly learning horizontal connections in hidden layers 
leads to an inefficient inference. To consider, to some extent, statistical dependencies of states of 
hidden units and meanwhile keep the exact and efficient inference in RBMs, we introduce a l\/l2 
regularizer on the activation possibilities of hidden units given training data. 

I1/I2 regularizes have been intensively studied in both the statistics community [6] and machine 
learning community [7]. Usually the I1/I2 regularizer (or group lasso) only leads to sparsity at the 
group level but not within a group. Friedman et al [8] present a more general regularizer that blends 
11 and Z1/Z2 regularizer. Using this regularizer, the linear model can yield sparsity both at the group 
level and within the group. 
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In this paper, we show that introducing the I-1/I2 regularizer on the activation possibilities can yield 
sparsity at not only the group level but also the hidden unit level because of the Logistic differential 
equation. In the experiment section, we show that the sparse group RBM can be a better and sparser 
generative model than RBMs. And using the sparse group RBM can also learn more discriminative 
features. 



2 Restricted Bolzmann Machines and Contrastive Divergence 

A Restricted Boltzmann Machine is a two layer neural network with one visible layer representing 
observed data and one hidden layer as feature detectors. Connections only exist between the visible 
layer and the hidden layer. Here we assume that both the visible and hidden units of the RBM are 
binary. The models below can be easily generalized to other types of units [9]. The energy function 
of a RBM is defined as 

E(x, h) = - XihjWij - Xibi - ^2 hjCj (1) 

i,j i j 

where X{ and hj denote the states of the ith visible unit and the jth hidden unit, while Wij represents 
the strength of the connection between them. b. L and Cj are the biases of the visible and hidden units 
respectively. 

Based on the energy function, we can define the joint distribution of (x, h) 

P{x,h) = eXp{ - E{x > k)) (2) 
Z 

where z is the partition function Z = J2 X h exp(— h)). 
The activation probabilities of the hidden units are 

P{hj\x) = sigmoid(x T w.j) 

= 1 (3) 

1 + exp(— x T w.j) 

where w.j denotes the jth column of W , which is the connection weights between the jth hidden 
unit and all visible units. Because of x T w.j — cos(a)\\w.j\\2\\x\\2 (a is the angle of the vector w.j 
and x), the activation probability can be interpreted as the similarity between x and the feature w.j 
in data space. 

The activation probabilities of the visible units are 

P(xi\h) — sigmoid(wi.h) (4) 

The marginal distribution over the visible units actually is a model of products of experts [ 1 ,2] 

= n 3 (l + exp(xV 3 )) 
1 ' Z 

From Equation|5]we can deduce that each expert (hidden unit) will contribute probabilities according 
to the similarity between its feature and the data vector x. The feature of the jth expert (the jth 
hidden unit), w.j can be seen as a prototype defined in data space [2]. 

The objective of generative training of a RBM is to model the marginal distribution of the visible 
units P(x). To do this, we need to compute the gradient of the training data likelihood 

d log P(ajW) dE{x {n \h) dE(x,h) (e , 

m < — m — > w>) + < >p{x ' h) 

where < . >p is the expectation with respect to the distribution P. The second term of Equation 
|6]is intractable since sampling the distribution P(x, h) requires prolonged Gibbs sampling. Hinton 
[1] shows that we can get very good approximations to the second term when running the Gibbs 
sampler only k steps (the most commonly chosen value of A: is 1), initialized from the training data. 
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Named Contrastive Divergence (CD) [1], the algorithm updates the feature of the jth hidden unit 
seeing training data set {x^\ x^ 2 \ i' 1 )} 

1 L 

Aw.j = -J2 p ( h i = l\x {l) ) ■ x {l) - P{h 3 = l\x {l) -)-x^- (7) 
l=i 

where is sampled from P{x\h^') (h^ sampled from P(h\x"')). The first term of Equation 7 
will decrease energies of the RBM at {x^\ x^ , x^ L '}. The second term will increase energies 
at {a^ 1 ) - , x^~ , x^~} which possible are near the training data set and have low energies in the 
RBM. Iteratively updating parameters ensures that the probability distribution defined by the energy 
function has the desired local structure. Furthermore from Equation|7] one specific hidden unit will 
be responsible for decreasing (or increasing) only a subset of training data (or negative sample), 
which is selected by the vector of activation probabilities, {P(hj = l\x^ >),..., P(hj — l|a;( L ))) T 
(or (P(hj = l\x^^), P(hj = l\x^ L,> ~)) T ). Thus the learning algorithm ensures that different 
hidden units learn different features. 



3 Sparse group RBMs 

As discussed in Section 1, directly learning the statistical dependencies between hidden units is 
inefficient. To alleviate this problem, firstly we averagely divide hidden units into predefined non- 
overlapping groups to restrain the dependencies within these groups. Secondly, instead of directly 
learning the dependencies we penalize the overall activation level of a group to force hidden units 
in the group to compete with each other. To implement the two above intuitions we introduce a 
mixed-norm (h/h) regularizer on the activation possibilities of hidden units given the training data. 

Assuming a RBM has F hidden units, let T~L denote the set of all hidden units' indices: T~L = 
{1,2,..., F}. The fcth group is denoted by Qk where Qk C H, k = 1, K. Given a grouping Q and 
a data point x^ n \ the fcth group norm Nk is given by 



N * = J £ P(hrn = l\xM) 2 (8) 



which is the Euclidean (I2) norm of the vector composed of these activation possibilities and con- 
sidered as the overall activation level of fcth group. Given the group norms, the mixed-norm is 



K K 



E 1^*1 =£4/ £ P(hm = l\xW) 2 (9) 

fe=l fe=l y meGfc 

which is the li norm of the vector composed of the group norms. 

We add the I1/I2 regularizer to the log-likelihood of training data (see Equation|6]l. Thus, given 
training data {x^\ x^ 2 \ 2;( L )}, we need to solve the following optimization problem 

l k , 



maximize WAc Y^ log P(x {iy ) - / ^ P{hj = l\xW) 2 (10) 

1=1 fc=iy jeQh 

where Qk is the index set belonging to the fcth group of hidden units and A is a regularization 
constant. 

The effects of a h/h regularization can be interpreted on two levels: an across-group and a within- 
group level. On the across-group level, the group norms Nk behave as if they were penalized by a l\ 
norm. In consequence, given observed data some group norms are zero, which means the activation 
possibilities of all hidden units in these groups are zero since the activation possibilities are non- 
negative. On the within-group level, the l 2 norm will equally penalize the activation possibilities of 
all hidden units in the same group. In other words, the I2 norm does not yield sparsity within the 
group. However, when applied the l\/l2 on the activation possibilities it is an entirely different story 
because of the Logistic differential equation. Below we will discuss it in detail. 
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(a) 



(b) 



Figure 1: Assuming the jth hidden unit belongs to the fcth group and q = J2 m <£G k m^j P{hm — 
l\x">) 2 . The regularization constant A, is set to 1. In both of two figures the horizontal axis repre- 
sents activations of jth hidden unit, (a) Vertical axis: the coefficients of data x^ n '; (b) Vertical axis: 
the values of the second term in the coefficients of data x^ 



By introducing this regularizer the Equation|7]is changed to the following equation 

I V (p( hj = iixO) - a^2H5) • s m - p(hj = 11,(0-) 

(11) 

where j £ Gk- As discussion in Section 2, the decrease of the RBM's energies at the training data 
a;(0 is determined by P(hj — l\x l ) in Equation [7] With the h/h regularization the decrease is 
determined by not only P(hj = l\x) but also other hidden units' activation possibilities in the 
same group. To interpret properties of this regularizer, we visualize the coefficients of data point 
X W i n Figure [T(a)| and the second term of the coefficients in Figure [T(b)1 

1 = EmgGt P{hm = l|a;(0) 2 can be interpreted as a overall activation level of the hidden 
units in fcth group except jth hidden unit. A small value of q indicates that most of hidden units 
in the group are inactive for the data x^ l \ In Figure [1(a)] a smaller q slow down the decrease of 
the RBM's energies at data point a;W. In consequence a hidden unit is penalized strongly if the 
activation possibilities of the remaining hidden units in the corresponding group are very small. 
Thus, the first property of the h/h regularizer is that it encourages few groups to be active given 
observed data. This property yield the sparsity at the group level. 



In Figure [1(5)1 me effect of the regularizer will vanish when P(hj — l|a;(0) is close to or 1 
because of the factor P{hj — l\x^) 2 P(hj = 0|x(0) in EquationfTTI A bigger q indicates that most 
of hidden units in the group are active for the data a;(0. In the Figure [Kb)] it can also be seen that 
the effects of the regularizer become smaller when the activation possibilities are close to instead 
of them close to 1 because of the square of P(hj = l|x(0). This can be interpreted as that hidden 
units in a group compete with each other for modeling data x"\ Only few of hidden units in a group 
will win this competition. Thus, the second property of the regularizer is that is results in only few 
hidden units to be active in a group. This property yields the sparsity within the group. Based on 
these two properties, we call RBMs trained by Equation[lO]sparse group RBMs. 



4 Relationship to third-order RBMs 

A third-order RBM described in [10] is a mixture model whose components are RBMs. To a cer- 
tain extent, a trained third-RBM defines a special group sparse representation for training data. 
Discussing the relationships between a third-RBM and sparse group RBMs will give us additional 
insights about the effects of I1/I2 regularizer for RBMs. 
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The energy function of a third-order Boltzmann machine is 



E(x, h,z) = -J2 Xih)w%z k (12) 

where z is a if -dimensional binary vector with 1-of-if activation and represents the cluster label. 
The responsibility of the fcth component RBM is 

Y,i=iP(x,zi = 1) 
11,(1 + exp(x T w k J )) 

"E^n^a+exp^ViO) (13) 

0(1/(1-^ = 1^)) 



££iIL-'(i/(i--P(4 = i|*)) 



A 3-order RBM with K components can be seen as a regular RBM in which hidden units are 
divided into K non-overlapping groups. Given a data x the responsibility, P(z\x) is used to pick 
one group's hidden units to respond to the data. In other words, data will be represented by only 
one group' hidden units and the states of hidden units in other groups will be set to 0. From this 
perspective, a third-order RBM yields a special group sparsity given the training data. 

The group (component) which has the bigger value of J\ ■ (1/(1 — P(hj = l\xj) will more likely be 
responsible for the data x. Given the data x the product ■ (1 / (1 — P(hj = l\x j) can be interpreted 
as a measure of overall activation level of hidden units in the group. If more hidden units in the group 
are active, the overall activation level of the group is higher. However the products are unbounded 
and at very different numerical scales since any hidden unit's activation possibility (P(hj = l\x)) 
in a group that is close to 1 will make the product extremely big. To alleviate this problem, Nair and 
Hinton [10] introduced a temperature parameter T to reduce scale differences in the products. 

There are two major differences between 3-order RBMs and sparse group RBMs. Firstly, sparse 
group RBMs define a different overall activation level of a group's hidden units, which is the eu- 
clidean norm of the vector, (P(hj — l\x))j e g k . Since this measure is bounded and in the interval 
[0, \ Gk \], it can be avoided that one group with a too high overall activation level shields all of other 
groups. Secondly, as discussed in Section 3, sparse group RBMs yields sparsity at both the group 
level and the hidden unit level by regularization. It is a more flexible method than as done with 
third-order RBMs. Third-order RBMs directly divide training data into subsets by P(z\x), where 
each subsets is modeled by hidden units in one specific group. 



5 Sparse group deep Bolzmann machines 

Salakhutdinov and Hinton [11] presented a learning algorithms for tractable training a deep multi- 
layer Bolzmann machines, in which, unlike deep belief networks, hidden units will receive top-down 
feedback. Based on their learning algorithm, we illustrate that the regularizer can also be added to a 
deep Bolzmann machine. This leads to a sparse group deep Bolzmann machine. Taking a two-layer 
Boltzmann machine for example, the energy function is 

E(x,h\h 2 ) = ^x T W 1 h 1 -x T W 2 h 2 (14) 
For training a sparse group deep Bolzmann machine, we propose the following optimization problem 

L K K' 

maximize w i tW 2j2^S p ( xil) )- x J2 / J2 P ^ h ) = l \ x(l) ) 2 ~ A ^ / ^ = l\x^) 2 

(15) 

Given observed data the two activation probabilities can not be computed efficiently. Thus, following 
[1 1], we adopted the mean-field approximations for these two probabilities. 
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6 Experiments 



We firstly applied sparse group RBMs to three tasks: modeling patches of natural images, mod- 
eling handwritten digits and pretaining a multilayer feedforward network for handwritten digits 
recognition. The first two tasks are adopted to evaluate the performances of sparse group RBMs 
(as a generative model) on modeling different types (real-valued and binary) of observed data. The 
third task partially accesses the performances of using features learned by sparse group RBMs on a 
discriminative task. We also trained a sparse group deep Boltzmann machine for handwritten digits 
recognition. 

A bigger group easily leads to a bigger q (see Section 3), which make hidden units in this 
group receive weaker penalties and moderate the competitions among hidden units in this group. 
In the meanwhile using a bigger A can keep the competitions intense though q is big. However, 
a big A may lead to a negative coefficients of x"' in Equation QT| and prevent hidden units from 
modeling x"> when q is small. So the group size needs to be set small (empirically below 10). And 
in all experiments, the sparse group RBMs' parameter A, is empirically set to 0.1 which ensures the 
regularizer not to dominate the learning. 

6.1 Modeling patches of natural images 

Using regular RBMs trained on patches of natural images will learn relatively diffuse, unlocalized 
features. Lee et al. [12] proposed sparse RBMs to model natural images. Because sparse group 
RBMs yields sparsity at the hidden units level, we show sparse group RBMs can also be used for 
modeling patches of natural images. 

The training data used consists of 100, 000 14 x 14 patches randomly extracted from a standard set 
of 10 512 x 512 whitened images as in [13]. We divided all patches into mini-batches, each of which 
contained 200 patches, and updated the weights after each mini-batch. 




Figure 2: Learned features with sparse group RBM trained on patches of natural images 
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Figure 3: Learned features with sparse RBM trained on patches of natural images 

We trained a sparse group RBM with 196 real-valued visible units and 400 hidden units which are 
divided into 80 uniform non-overlapping groups. There are 5 hidden units in each group. The 
learned features are shown in Figure|2] The features are localized, oriented, gabor-like filters. 

For comparison, we also trained a sparse RBM [12] with 400 hidden units. The learned features are 
shown in Figure [3] The sparse RBMs' parameters, p and A are set to 50 and 0.02 as suggested in 
[12]. As discussed in Section 3, hidden units in a group compete with each other to model pathes, 
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each hidden unit in the sparse group RBM is focused on modeling more subtle patterns contained in 
training data. As a result, the features learned with the sparse group RBM are much more localized 
than those learned with the sparse RBM. 



6.2 Modeling handwritten digits 

We also applied sparse group RBMs algorithm to the MNIST handwritten digit datasefl The train- 
ing data, 60,000 28 x 28 images were divided into mini-batches, each of which contained 100 
images. We trained two sparse group RBMs with 600 hidden units and different group size (3 and 
5). We compare these models to a regular RBM with 600 hidden units. The learned features are 
shown in Figure|4] All of three models are trained by CD-I for 50 epochs with a same configuration 
of learning rate, weight decay and momentum. 

Due to space reasons, we show only the features of the sparse group RBM with group size 3 in 
Figure [5] The results of the sparse group RBM with group size 5 are similar. Many features in 
Figure |5]look like different strokes of handwritten digits. 



Figure 4: Learned features with RBM trained on MNIST dataset 



I I I I I I I 

Figure 5: Learned features with sparse group RBM trained on MNIST dataset. 3 patches in one row 
of every column are visualizations of features of hidden units in one specific group. 

Although computing the exact partition function of a RBM is intractable, Salakhutdinov and Mur- 
ray [14] proposed an Annealed Importance Sampling based algorithm to tractably approximate the 
partition function of an RBM. Using their method, the estimates of the lower bound on the average 
test log-probability are —123 for the regular RBM, —104 for the sparse group RBM with group 
size 3 and —111 for the sparse group RBM with group size 5. It can be seen that by adopting a 
proper regularization we can learn a better generative model on MNIST dataset than those without 
the regularization. 

We use Hoyer's sparseness measure [15] to figure out how sparse representations learned by RBMs 
and sparse group RBMs. This sparseness measure evaluate the sparseness of a D dimensional vector 
v in the following way 

sparseness(v) = == (16) 

This measure has good properties, which is in the interval [0, 1] and on a normalized scale. Its 
value more close to 1 means that there are more zero components in the vector v. With every trained 
models, we can compute activation possibilities of hidden units given 10, 000 test images. Given any 
trained model this leads to new representations (10, 000 600-dimensional vectors) of test data. The 
sparseness measures of the representations under the RBMs are in the interval [0.36, 0.6], with an 
average of 0.5. The sparseness measures of the representations under the sparse group RBMs with 
group size 3 and 5 are in [0.55, 0.75] and [0.51, 0.72]. The averages are 0.68 and 0.65, respectively. 



http://yann.lecun.com/exdb/mnist/ 
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It can be seen that the sparse group RBMs can learn much more sparser representations than regular 
RBMs on MNIST data sset. Figure |6(a)1 visuarizes the activation possibilities of hidden units, which 
are computed under the regular RBMs given an image from test set. Given the same image the 
activation possibilities computed under the sparse group RBMs are shown in Figure [6(b)1 




00 200 300 400 500 000 100 200 300 400 500 600 



(a) (b) 

Figure 6: (a) Activation possibilities computed under the regular RBMs. The sparseness of the 
vector is 0.49; (b) Activation possibilities computed under the sparse group RBMs with group size 
3. The sparseness is 0.68. 



6.3 Using sparse group RBMs to pretrain deep networks 

One of the most important applications of RBMs is to use RBMs as building blocks layer-by-layer 
to pretrains greedily deep supervised feedforward neural networks [3]. We show that sparse group 
RBMs can also be used to initialize deep networks and achieve better performances of classification 
on MNIST dataset. 

We use sparse group RBMs with different group size (3, 5 and 10) on MNIST dataset to pretain 
a 784-600-600-2100 networks. After pretraining, the multilayer networks are fine-tuned for 30 
iterations using Conjugate Gradient. The networks initialized by the sparse group RBMs with group 
size, 3, 5 and 10 achieve the error rates of 0.89%, 0.91% and 0.99%, respectively. Using regular 
RBMs pretraining a 784-500-500-2000 network achieved the error rate of 1.14% [3]. A network 
with the same architecture initialized by sparse RBMs gave a much worse error rate of 1.81% [16]. 



6.4 Sparse group deep Boltzmann machines 

We also trained a two layer (500 and 1000 hidden units) sparse group Boltzmann machine on MNIST 
dataset. The group size is set to 10 for both of two layers. Firstly, we use two sparse group RBMs 
(784-500 and 500-1000) to initialize the deep network. Then the learning algorithm described in 
Section 5 trains the sparse group deep Boltzmann machine. Finally we discriminative fine-tuned the 
network. The sparse group deep Bolzmann machine achieves the error rates of 0.84% on the test 
set, which is, to our knowledge, the best published result on the permutation-invariant version of 
the MNIST task. The deep Boltzmann machine with the same architecture without the sparse group 
regularization resulted in the error rate of 0.95% [11]. 



7 Conclusions 

In this paper, we introduce I1/I2 regularization on the activation possibilities of hidden units in 
restricted Boltzmann machines. This leads to a better and sparser generative model, sparse group 
RBM. 
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